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Abstract 

Low-frequency words place a major challenge for automatic 
speech recognition (ASR). The probabilities of these words, 
which are often important name entities, are generally under¬ 
estimated by the language model (LM) due to their limited oc¬ 
currences in the training data. Recently, we proposed a word- 
pair approach to deal with the problem, which borrows infor¬ 
mation of frequent words to enhance the probabilities of low- 
frequency words. This paper presents an extension to the word- 
pair method by involving multiple ‘predicting words’ to pro¬ 
duce better estimation for low-frequency words. We also em¬ 
ploy this approach to deal with out-of-language words in the 
task of multi-lingual speech recognition. 

Index Terms: speech recognition, language model, multilin¬ 
gual 

1. Introduction 

The language model (LM) is an important module in auto¬ 
matic speech recognition (ASR). The most well-known lan¬ 
guage modelling approach is based upon word n-grams, which 
relies on statistics of n-gram counts to predict the probability of 
a word given its past n-1 words. In spite of the wide usage, the 
n-gram LM possesses an obvious limitation in estimating prob¬ 
abilities of words that are with low frequencies and the words 
that are absent in the training data. For low-frequency words, 
the probabilities tend to be under-estimated due to the lack of 
occurrences of their n-grams in the training data; for words 
that are absent in training, estimating the probabilities is simply 
impossible. Ironically, these words are often important entity 
names that should be emphasized in decoding, which means the 
probability under-estimation for them is a serious problem for 
ASR systems in practical usage. 

A well-known approach to dealing with low-frequency and 
absent words is various smoothing techniques such as back¬ 
off (2 and discount Em. This approach allocates a small pro¬ 
portion of probability mass to low-frequency and absent words 
so that they are allowed to be recognized. However, the allo¬ 
cated probabilities for these words are very small, which makes 
it unlikely to be well recognized unless the acoustic evidence 
is fairly strong. Besides, this approach does not support flex¬ 
ible enhancement for words that are important for a particular 
domain or application. 

Another famous approach is to train an LM with some 
structures that can be dynamically changed, e.g., the class-based 
LM with classes that are adaptable online [4J. These dynamic 
structures, however, need to be pre-defined and can not handle 
words that are not in the structure. For example, words that are 
not in the pre-defined classes cannot be handled by class-based 
LMs. Additionally, involving such dynamic structures often re¬ 
quires to modify the decoder, which is not ideal to our opinion. 

Recently, we proposed a similar-pair approach to deal with 
the problem 0. The basic idea is to borrow some information 
from high-frequency words to enhance low-frequency words. 
More specifically, we seek for a high-frequency word that is 
similar to the word to enhance, and then re-weight the proba¬ 
bility of the low-frequency word by adding a proportion of the 
probability of the high-frequency words to the the probability 


of the low-frequency words. This approach has been implement 
with the LM FST graph (6j. Compared to the traditional class- 
based LM approach, the new approach is flexible to enhance 
any words and does not need to change the decoder. It has been 
shown that this approach can provide significant performance 
gains for low-frequency words and words that are totally absent 
in the training data. 

This paper is a following work of |;5lj. We first present an 
extension that allows multiple high-frequency words (‘indicat¬ 
ing words’) to be used when enhancing a low-frequency word. 
This extension helps to involve multi-source information in the 
word enhancement, and is particular important for words with 
multiple senses. Secondly, the similar-pair approach is applied 
to deal with a particular low-frequency words: out-of-language 
(OOL) words that are from another language but embedded in 
utterances of the host language, for example English words ap¬ 
pearing in Chinese utterances. These words are totally new 
for the host language and no context information can be em¬ 
ployed to estimate the probabilities for them. The similar-pair 
approach can deal with the situation, by assuming that words 
in different languages share the same semantic space and hence 
similar pairs can essentially cross languages. The experimental 
results in Section[5]demonstrated the capability of this approach 
in dealing with UUL words. 

The remainder of this paper is structured as follows. Sec- 
tionpldiscusses relevant works, and the similar-pair method is 
described in section [3] The two new extensions are presented 
in Section [4] which is followed by Section [5] where the exper¬ 
iments are presented. Finally, the entire paper is concluded by 
Section[6] 

2. Related works 

This work is related to dynamic language modeling that adds 
new words and re-weighting word probabilities, particulary the 
approaches that are based on FSTs. This section reviews some 
typical techniques of this approach, and primarily focuses on 
the class-based LM that deals with dynamic vocabularies and 
low-frequency words. 

The class-based language modeling 0 is an approach that 
clusters similar words into classes and the probabilities of words 
in each class are re-distributed, for instance according to their 
unigram statistics. Typically, the class-based LM delivers bet¬ 
ter representations than the word-based LM for low-frequency 
words 0, since the class-based structure factorizes probabil¬ 
ities of low-frequency words into class probabilities and class 
member probabilities, and so increases robustness of the prob¬ 
ability estimation. Moreover, new words can be easily added 
into classes with the class-based LM, leading to a dynamic vo¬ 
cabulary. Additionally, 0 and introduced two techniques 
to build both the class-based LMs and the class words into FST 
graphs and embed class FSTs into the class-based LM FST. This 
embedding can be done on-the-fly, thus offers a flexible dy¬ 
namic decoding that supports instant introduction of new words. 
Similar approaches have been proposed in cni nu nu, where 
various dynamic embedding methods are introduced, and the 
classes are extended to complex grammars. 

The work is an extension of the similar-pair method pro- 


posed in 0. In this approach, the probabilities of low- 
frequency words are enhanced and new words are supported 
by adding new FST transitions, both referring to the transi¬ 
tions of the similar and high-frequency words. Compared to the 
other approaches mentioned above, this method is more flexi¬ 
ble, which supports any words instead of words limited in some 
pre-defined classes. 

The extensions we made in this paper for the work in 0 are 
two-fold: firstly, the similar-pair algorithm is extended to allow 
multiple predicting words, which enables multiple information 
engaged and thus better enhancement; second, the similar-pair 
approach is employed to deal with OOL words, which demon¬ 
strated that similar pairs can be cross-lingual. 

3. Word enhancement by similar-pairs 

In this section we first give a brief introduction to the FST-based 
ASR architecture, and then present the similar pair method im¬ 
plemented on FSTs. 

3.1. Finite state transducer 

A Finite State Transducer (FST) essentially is a Finite State Au¬ 
tomaton (FSA) which produces output as well as reading input. 
It is represented as a graph where every node indicates a state 
and every arc that links two nodes is assigned an input and an 
output symbol. Each transition and each terminated state is la¬ 
beled with a weight. An FST example is depicted in Fig IT] 
In this example, the initial state is state 0, and the final stale 
is state 2. A weight 3.5 has been assigned to the final state. 
Let ( s,t,i : o/w ) denotes a transition, where s and t are the 
entry and exiting states respectively, and i is the input symbol 
and o is the output symbol, and w is the associated transition 
weight. From the initial state 0 to state 1, there are two tran¬ 
sitions (0,l.a:x/0.5) and (0,l,b:y/1.5). From the state 1 to the 
final state 2, there is only one transition (l,2,c:z/2.5). An FST 
can accept a sequence of input symbols and generate a sequence 
of corresponding output symbols. For instance in Fig. HI given 
an input string ‘ac’, the transition (0,l,a:x/0.5) accepts the first 
character ‘a' and generates an output ‘x’ with weight 0.5, and 
the transition (l,2,c:z/2.5) accepts the second character ‘c’ and 
generates an output ‘z’ with weight 2.5. The weight of the tran¬ 
sition path is computed as the sum of the weights associated to 
each transition in the path, plus the weight associated with the 
finally state. In our example, the weight of the transition path 
that accepts ‘ac’ is 6.5. 



Figure 1: An FST example. 


3.2. FST-based speech recognition 

Most of current large-vocabulary speech recognition systems 
are based on statistical models, such as hidden Markov models 
(HMMs), lexicons, decision trees and n-gram LMs. All these 
models can be converted into the FST models. For an FST, the 
correlation between the input and output symbols will repre¬ 
sent the mapping from a low-level sequence (e.g., phones) to 
a high-level sequence (e.g., words), and the weights will en¬ 
code the probability distribution of the mapping. More impor¬ 
tantly, FSTs that represent different levels of statistical mod¬ 
els can be composed together to form a unified mapping func¬ 
tion that associates primary inputs to high-level outputs. The 
composed FST can be further optimized by standard FST op¬ 
erations, including determinization, minimization and weight 
pushing. This produces very compact and efficient graphs that 
represent the knowledge of all the statistical models involved in 
the composition. In speech recognition, the composition can be 
used to produce a very efficient graph that maps HMM states to 


word sequences. The graph building process can be represented 
as follows: 

HCLG = min(det(H oC o LoG)) (1) 

where H, C, L and G represent the HMM, the decision tree, the 
lexicon and the LM (or grammar in grammar-based recognition) 
respectively, and o, ‘deV and ‘ min’ denote the FST operations 
of composition, determinization and minimization respectively. 

3.3. Low-frequency word enhancement with similar pairs 

The similar pairs method is based on the FST architecture. In 
order to enhance low-frequency words, and for conducting the 
enhancement on the LM FST, or the G graph, a list of manually 
defined similar pairs are provided with corresponding frequency 
information obtained from training data. The low-frequency 
words are selected to be enhanced and the high-frequency words 
are chosen to provide the enhancement information. Each sim¬ 
ilar pair in the list includes one high-frequency word and some 
low-frequency words. Given a set of similar words, the low- 
frequency words are enhanced by looking at the information 
of the high-frequency word, including its transitions in the G 
graph and the associated weights. The high-frequency words 
are preserved since they have been well represented by the n- 
gram model already. 

4. The Method 

The extension of similar pairs method are introduced in this 
section. Based upon the similar pairs method, the probabil¬ 
ity of low-frequency or new words are enhanced by looking at 
the information of high frequency words. Given a set of words 
W = {xi,X 2 , ..., Xm} to be enhanced, for each word Xi £ W, 
a set of words S, = {y 1,1/2, ...,y n } that are similar to Xi is 
manually selected from the training data. The similarity can be 
defined in terms of either syntactic roles or semantic meanings. 
We assume that, for each y-j £ Si, if there exist an n-gram of yj 
in the training corpora, the corresponding n-gram of Xi should 
also have a relative higher probability of appearance. As the 
probabilities are represented as the weights in the G graph in 
FST, according to this assumption, the new weight (probability) 
of Xi can be updated by the equation 0- 

w Xi = w yj + ln( ) + 6 (2) 

Jx% ' JVj 

where 0 is a parameter that tunes the enhancement scale. Note 
that according to 0> a larger f Xi leads to a higher w Xi , which 
means that a more frequent word (still low-frequency) is as¬ 
signed a larger weights after enhancement, and so the rank of 
the low-frequency words in probabilities is preserved. However, 
if the word Xi is a new word, the logarithm term can be ignored. 
Then the FST can be updated with the new weight. Let A(y.j) 
denote the set of all the transitions of the word yj. A(yj) can be 
retrieved by searching through the G graph. For each transition 
(s, t, yj : Vj/w y - £ A(yj) in the G graph, check if a transition 
(s,t, Xi : Xi/w Xi ) exist in G for Xi. If it exists, the wight w Xi 
will be adjusted to a new weight w' x . , otherwise, a parallel arc 
of transition (s , t, Xi : Xi/ui x .) will be added. The new weight 
w' x . can be calculated by the equation 0. 

An example of the enhancement process is illustrated in 
Fig. 0 where Fig. |2ja) shows the G graph before the enhance¬ 
ment, and Fig. 15 ] dJi shows the G graph after the enhancement. 
Note that ‘a’ is the high-frequency word, and (a,c) forms a sim¬ 
ilar pair. A new transition has been added in Fig. [2jb) for the 
low-frequency word c. 

Comparing with the original similar pairs method, in this 
extension method each low-frequency or new words Xi will be 
enhanced by multiple high-frequency words rather than only 
one. This will increase the search space of the low-frequency 
word in the G graph. Since the corresponding high-frequency 
referral words in 5, are selected to be similar to Xi, the paths of 
Xi which are added to G are most likely to be correct in gram¬ 
mar and reasonable in meaning. 






Figure 2: An example of a low-frequency word enhancement 
based on similar pairs. (a,c) is a similar pair, where ‘a’ is the 
high-frequency word, and ‘c’ is a low-frequency word. A new 
transition is added in (b). 


In addition, multiple referral words provide more nearly 
complete information especially in multi-lingual situation, since 
the one-to-one correspondence are merely existed in few occa¬ 
sions. The usage of a word in one language could have multiple 
variants in another language. It has been verified by our experi¬ 
ment results in the section]5] 

5. Experiment 

The bilingual ASR tasks in the telecom domain is chosen to 
evaluate the proposed approach. We first introduce the exper¬ 
imental configurations, and then present the performance with 
the proposed low-frequency English words enhancement based 
on similar pairs. 

5.1. Database 

Our ASR task aims to transcribe conversations recorded from 
online service calls. The domain is the telecom service and the 
language is in Chinese or English. The acoustic model (AM) is 
trained on an 1400-hour online speech recording which is man¬ 
ually transcribed from a large call center service provider. The 
Chinese LM is trained on a corpus including the transcription of 
the AM training speech and some logs of web-based customer 
service systems in the domain of telecom service. 

22 similar pairs are selected to evaluate the performance 
of the similar-pair method. Each similar pair contains 1~5 
high-frequency Chinese words and 1~4 new English words. A 
‘FOREIGN’ test set was deliberately designed to test the en¬ 
hancement with these similar pairs, which consists of 42 sen¬ 
tences from online speech recording. For each transcription, 
some new English words that appear in the similar pairs are in¬ 
volved. 

Additionally, a ‘GENERAL’ test set that involves 2608 ut¬ 
terances is selected as a control group to test the generalizabil- 
ity of the proposed method. Each utterance in this set contains 
words in various frequencies and therefore it can be used to ex¬ 
amine if the proposed method impacts general performance of 
ASR systems at the time of enhancing low-frequency and new 
words. 

5.2. Acoustic model training 

The ASR system is based on the state-of-the-art HMM-DNN 
acoustic modeling approach, which represents the dynamic 
properties of speech signals using the hidden Markov model 
(HMM), and represents the state-dependent signal distribution 
by the deep neural network (DNN) model. The feature used is 
the 40-dimensional FBank power spectra. A 11-frame splice 
window is used to concatenate neighboring frames to capture 
long temporal dependency of speech signals. The linear dis¬ 
criminative analysis (LDA) is applied to reduce the dimension 
of the concatenated feature to 200. 

The Kaldi toolkit fl3l is used to train the HMM and DNN 
models. The training process largely follows the WSJ s5 GPU 
recipe published with Kaldi. Specifically, a pre-DNN system 


is first constructed based on Gaussian mixture models (GMM), 
and this system is then used to produce phone alignments of the 
training data. The alignments are employed to train the DNN- 
based system. 

5.3. Language model training 

The training text is normalized before training. The normaliza¬ 
tion includes removing unrecognized characters, unifying dif¬ 
ferent encoding schemes and normalizing the spelling form of 
numbers and letters. Then the training text is segmented into 
word sequences. A word segmentation tool provided by Google 
is used in this study. There are totallyl50,000 words are se¬ 
lected as the LM lexicon, according to the word frequency in 
the segmented training text. The SRILM toolkit Q is then used 
to train a 3-gram LM, which is smoothed by Kneser-Ney dis¬ 
counting. The Kaldi toolkit is used to convert n-gram LMs to G 
graphs, and the openFST toolkiQis used to manipulate FSTs. 

5.4. Experiment result and analysis 

The ASR performance in terms of the word error rate (WER) 
is presented in Table IT] Table |2| and Table [3] We report the re¬ 
sults on two test sets: ‘GENERAL’ and ‘FOREIGN’, and the 
results with different values of the enhancement scale 9 and dif¬ 
ferent amounts of high-frequency Chinese words are presented. 
It can be seen that with the similar-pair-based enhancement, 
the ASR performance on utterances with new English words 
is significantly improved. In addition, comparing with the ap¬ 
proach of one high-frequency Chinese word, the approach of 
multiple high-frequency Chinese words has better results. In¬ 
terestingly, the enhancement on these infrequent words does not 
cause degradation on the other Chinese words in ‘GENERAL’ 
test set. Moreover, the performance on the ‘GENERAL’ test set 
nearly remains unchanged, which indicates that the proposed 
approach does not impact general performance of ASR systems, 
and thus is safe to employ. For a more clear presentation, the 
trends of WERs on the two test sets with different values of 
9 and different amounts of high-frequency Chinese words are 
presented in Fig. [3] 




WER% 



GENERAL 

FOREIGN 

Baseline 

- 

3X75 

77X4 

+ SP 

-4 

33775 

66.95 


-2 

33.76 

64.39 


0 

55. / / 

62.11 


2 


62.96 


4 

33.83 

59X 


Table 1: WERs with and without the similar-pair-based en¬ 
hancement. ‘SP’ stands for enhancement with similar pairs, 
which use one high-frequency Chinese word. 9 is the enhance¬ 
ment scale in equationlpl. 



WER% 

N um. 

9 

1 

2 

3 

4 

5 

-4 

33.76 

33.77 

33.77 

33.78 

33.78 

-2 

33.76 

33.77 

33.77 

33.77 

33.78 

0 

33.77 

33.79 

33.79 

33.78 

33.79 

2 

33.8 

33.83 

33.82 

33.81 

33.82 

4 

33.83 

33.96 

33.96 

33.97 

33.96 


Table 2: WERs with different amounts of high-frequency Chi¬ 
nese words on ‘GENERAL’ test set. 9 is the enhancement scale 
in equation|2]), ChNum is the amount of high-frequency Chi¬ 
nese words. 


1 http://www.speech.sri.com/projects/srilm/ 
2 http://www.openfst.org 

































WER% 

N um 

9 . 

1 

2 

3 

4 

5 

-4 

66.95 

62.68 

60.4 

60.68 

61.82 

-2 

64.39 

63.53 

61.54 

61.54 

62.68 

0 

62.11 

64.96 

66.95 

66.95 

65.53 

2 

62.96 

66.95 

65.53 

65.53 

65.53 

4 

69.8 

73.5 

77.49 

77.49 

77.78 


Table 3: WERs with different amounts of high-frequency Chi¬ 
nese words on ‘FOREIGN’ test set. 9 is the enhancement scale 
in equation|2]l, ChNum is the amount of high-frequency Chi¬ 
nese words. 


e=-4 



Figure 3: WERs on the two test sets with increasing amounts of 
ChNum. The value of 9 is -4 


To further examine the gains offered by the proposed ap¬ 
proach, the name entity error rate (NEER) is used. In contrast 
to the WER that measures the accuracy on all words, the NEER 
evaluates the accuracy on focused words, i.e., the words to en¬ 
hance (the new English words) and the other Chinese words in 
‘GENERAL" test set. The results are presented in Table |4] Ta¬ 
ble [5] and Table [6] It can be seen that the similar-pairmased 
enhancement does deliver a much better accuracy on the new 
English words. Importantly, the improvement on the infrequent 
words does not impact the performance on the other Chinese 
words in ‘GENERAL’ test set, which confirms the effective¬ 
ness and safety of the proposed method. Again, the trends of 
NEERs on the two test sets with different values of 9 and dif¬ 
ferent amounts of high-frequency Chinese words are presented 
in Fig.|3] 




NEER% 


9 

CHINESE 

ENGLISH 

Baseline 

~ 

46.85 

TOO 

+ SP 

-4 

48.65 

72709 


-2 

50743 

48784 


0 

50.45 

32.56 


2 

54705 

0 


4 

59.46 

0 


Table 4: NEERs with and without the similar-pair-based en¬ 
hancement. ‘SP’ stands for enhancement with similar pairs, 
which use one high-frequency Chinese word. 9 is the enhance¬ 
ment scale in equationpl 


6. Conclusion 

In this paper, we proposed a similar-pair-based approach to en¬ 
hance speech recognition accuracies on low-frequency and new 
words. This enhancement is obtained by exploiting the infor¬ 
mation of high-frequency words that are similar to the target 
words. The experimental results demonstrated that the proposed 
method can significantly improve performance of speech recog- 



NEER% 

A um 

9 . 

1 

2 

3 

4 

5 

-4 

48.65 

49.55 

48.65 

48.65 

48.65 

-2 

50.45 

49.55 

51.35 

51.35 

51.35 

0 

50.45 

51.35 

56.76 

56.76 

56.76 

2 

54.05 

58.56 

57.66 

57.66 

58.56 

4 

59.46 

63.96 

63.06 

62.16 

62.16 


Table 5: NEERs with different amounts of high-frequency Chi¬ 
nese words on Chinese words of ‘FOREIGN" test set. 9 is the 
enhancement scale in equation|2]t, ChNum is the amount of 
high-frequency Chinese words. 



NEER% 

A um. 

9 . 

1 

2 

3 

4 

5 

-4 

72.09 

58.13 

53.49 

51.16 

48.84 

-2 

48.84 

30.23 

34.88 

32.56 

32.56 

0 

32.56 

20.93 

13.95 

13.95 

9.3 

2 

0 

0 

0 

0 

0 

4 

0 

0 

0 

0 

0 


Table 6: NEERs with different amounts of high-frequency Chi¬ 
nese words on low-frequency English words in ‘FOREIGN’ test 
set. 9 is the enhancement scale in equation{2]l, ChNum is the 
amount of high-frequency Chinese words. 


nition on low-frequency and new words and does not impact the 
ASR performance in general. This lends this method to quick 
domain-specific adaptation where low-frequency words need to 
be enhanced and new words need to be supported. Future work 
involves enhancing low-frequency words using multiple simi¬ 
lar words, and combining this method with other dynamic LM 
approaches such as the class-based LM. 
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e = -4 



Figure 4: NEERs on the two types of words (low-frequency 
English words and other Chinese in ‘FOREIGN’ test set) with 
increasing amounts of ChNum . The value of 9 is -4 
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